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ABSTRACT 


The DEC KUV11-AA Writable Control Store was used to 
implement selected portions of the U.C.S.D. Pascal P-machine 


in firmware. 


The frequency and execution speed of P-machine 


instructions were measured in a battery of test programs to 
A 32 to 46% reduction in 
execution time has been obtained for these test programs. 


guide the selection process. 


INTRODUCTION 


The recent release of the KUV11-AA Writable Control 
Store (WCS)! has permitted user access to a level 
of the LSI-11 previously restricted to a select 
few. A number of interesting projects are possible 
with this product such as the application described 
here, the microprogramming of portions of Ken 
Bowles' U.C.S.D. Pascal P-machine. 2 


The LSI-11 Microprocessor and Writable Control Store 


The LSI-11 is a microprocessor microprogrammed to 
emulate the PDP-11 instruction set. The 
microprocessor has 26 eight-bit registers that are 
addressed by a combination of direct and indirect 
means. Its instruction set is highly vertical, 
i.e., it is not unlike a conventional, albeit 
primitive, machine language. Control is exercised 
by a Translation Array, consisting of four 
programmed logic arrays, that examines the fetched 
PDP-11 level machine instruction and determines 
where microprogram execution is to begin. The 
Translation Array may continue to exert control by 
generating new inputs to the location counter as a 
function of the current value of the location 
counter, interrupt signals, and other control 
inputs. 


The memory address space is 2K words. It is 
divided into four 512 word pages. Half of this 
address space, or two pages, is used to emulate the 
PDP-11. The EIS/FIS chip is optional and adds a 
third page of microcode that emulates an extended 
PDP-11 instruction set. These additional 
instructions include integer multiply and divide 
plus a battery of floating point instructions. 
fourth page is left unused. 


The 


The WCS contains a IK, or two page, random access 
memory that is primarily intended to be used as the 
third and fourth page of the microprocessor 

memory. Use of the WCS as the third page of memory 
is restricted by the Translation Array. Normally 
the EIS/FIS code resides on this page. To 
facilitate its execution the Translation Array is 
programmed to perform various control functions 
when execution reaches specific memory locations on 
the page. User microprograms must avoid these 
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locations. Complications can be avoided entirely 
by loading the EIS/FIS code into one page of the 
WCS and restricting new microprograms to the other. 


DEC has allocated opcodes 76700-76777 for user 
microprogramming. When the Translation Array 
encounters an opcode in this range, it directs 
control to a single address in the fourth page of 
memory. The user is then responsible for decoding 
the individual opcodes. Use of the Translation 
Array to facilitate execution of user microprograms 
is not supported. 

The U.C.S.D. Pascal P-machine 

The U.C.S.D. Pascal system is a complete 
stand-alone system designed to run on micro- and 
minicomputers. One of its most impressive features 
is its use of an underlying P-machine. The 
P-machine is a stack-oriented pseudocomputer that 
exists as an interpreter written in the assembly 
language of the host computer. Pascal source code 
is compiled to an intermediate P-code that is, in 
effect, the assembly language of the P-machine. 
This design makes the system highly portable. The 
operating system itself is written in Pascal. Only 
the relatively small native code interpreter must 
be written to transfer it to a new host. To date 
there have been successful implementations on the 
PDP-11 series and the 8080 family of 
microprocessors, including the 8080A and 8085. 
Other advantages of this type of implementation 
include the efficient use of small memories and 
fast compilation speed. 


PROJECT DEFINITION 


As described above, the execution of the U.C.S.D. 
Pascal P-machine is a two-level process. The 
LSI-11 microprocessor emulates a PDP-11 computer, 
which in turn simulates the P-machine. We are 
exploring the possibility of having the LSI-11 
microprocessor emulate the P-machine directly. The 
primary advantage will be an increase in program 
execution speed. 


One of our initial observations was that there is 
not enough room in the WCS to implement the entire 
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P-machine. On the average, it requires more 
microcode than macrocode to implement a P-machine 
instruction. One measure put the ratio at 1.7 
words of microcode for every word of macrocode. 

The current LSI-11 macro-level interpreter requires 
2.4K words. A microprogrammed P-machine, we 
estimate, will require 3 to 4K words of microcode 
without the use of the Translation Array. 


For this project we used one of the pages of the 
WCS to implement portions of the P-machine and the 
other page to contain the EIS/FIS code. Our task 
was to select the best portions to microcode. 


SELECTION OF PORTIONS OF THE P-MACHINE TO 
MICROCODE 


Control Structure 


To execute a P-machine instruction, an opcode must 
be read from macro-level memory, control 
transferred to the appropriate routine for 
execution, and control returned for the next 
instruction fetch. This process will be called the 
interpreter fetch sequence. In the macro-level 
P-machine interpreter, this takes approximately 
25us for most instructions and 30 to 45% of the 
total program execution time. This data makes the 
interpreter fetch sequence an excellent candidate 
for microprogranming. 


The interpreter fetch sequence in our micro/macro 
interpreter uses a variation of the scheme used in 
the macro-level interpreter. In this scheme the 
P-machine opcode is used to index a table of 
macro-addresses for the respective routines. This 
same table is used by the mcro/macro interpreter 
with the difference that micro-addresses are also 
stored in the table. The addresses are 
differentiated by having the high order four bits 
of the words containing the 1ll-bit micro-addresses 
set to ones. Macro-addresses never have these bits 
set because the high end of memory is normally 
reserved for I/O devices. oe 


The microcoded interpreter fetch sequence has two 
entry points. First, its execution can be 
initiated from macrocode by the opcode 76704. 
Second, after a microcoded P-machine instruction is 
executed, a jump is made directly into the 
interpreter fetch routine. The direct entry from 
microcode is faster than from macrocode. 


The speed of the microcoded interpreter fetch 
sequence averages approximately 14.5us for most 
instructions. This is somewhat disappointing, but 
nevertheless is our most successful single 
microcoded routine. It alone can reduce program 
execution time by 12 to 192. 


The micro/macro control structure can handle 
microcode instructions in addition to 76704. 
Useful instructions include general purpose 
instructions such as a block move. Other useful 
ones are specialized instructions that execute the 
frequently used parts of a P-machine instruction, 
leaving the logic for handling special cases or 
error conditions in macrocode. The scheme for 
supporting the non-76704 instructions centers 
around the use of the microstruction, Modify 
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Instruction (MI). This technique is discribed in 
detail in the WCS Users Guide. Briefly, the MI 
instruction uses the fetched macro-level 
instruction to index a table of jump instructions 
in the microprocessor memory. 


Another key element of the control structure is the 
handling of interrupts. Periodically, during the 
execution of long microcode routines a check is 
made to see if interrupts are pending. If an 
interrupt has occurred, execution is aborted and 
control returns to macrocode to service the 
interrupt. After the interrupt has been serviced, 
the micro-routine is restarted from the beginning. 
Microcoded routines that may be aborted must be 
careful to postpone making permanent changes until 
after the last interrupt check. 


P-machine Instructions 


A series of tests were conducted to determine the 
best P-machine instructions to microcode. For 
these tests a DEC KWV11-A programmable real-time 
clock was used to maintain a running count of the 
number of microseconds spent executing each 
P-machine instruction. Also, each instruction was 
counted as it was executed. From this data the 
average execution speed, percent of program 
execution time, and percent of instruction 
execution frequency were calculated for each 
instruction. Instructions with a high percentage 
of program execution time and/or a high frequency 
of execution are prime targets for 
microprogramming. Frequency of execution is 
important because of the faster direct entry from 
microcode to the interpreter fetch sequence. 


Unfortunately, there does not exist a typical 
program that can be tested. Instead a battery of 
test programs was assembled to gain insight into 
commonly used programs. These test programs are as 
follows: 
Compilations - Six programs totalling 3341 
lines of source code were 
compiled. The programs were 
selected at random. Two were 
written by one of the authors of 
this paper, one was WHETSTONE 
(see below), and three (XREF, 
CALC, & RTI1TOEDIT) were 
selected from the software 
distributed by U.C.S.D.. 
WHETSTONE = This program is a synthetic 
benchmark developed by H. Curnow 
and B. Wichmann?. It | 
exercises a computer in a manner 
considered typical of scientific 
programs. Specifically, it 
includes array manipulation, 
conditional jumps, procedure 
calls, integer arithmetic, and 
trigonometric and other standard 
functions using real numbers. 
Sorts a Three programs, Quicksort 
(recursive), Quicksort 
(nonrecursive), and Heapsort, 
were used to sort an identical 
array of 3,000 reasonably random 


integers. The algorithms used 
were based on those given by N. 
Wirth’. 

Miscellaneous 

exploratory 

programs Two programs were run in the 


hopes of gaining special 
insights into the behavior of 
the P-machine. The first 
creates a cross-reference of a 
Pascal source program. This 
XREF program was written at 
Sperry-Univac. The second 
(BTSI) builds a balanced tree in 
the heap and conducts searches 
of that tree. Again, the 
algorithm was based on one by N. 
Wirth’. 


An example of these test results is presented in 
Table 1l. 


Table l. 


the compilation of 3341 lines of source code. 


eliminate the time for the interpreter fetch sequence. 
includes the interpreter fetch sequence time in the calculation. 


With this test data, we are able to select portions 
of the P-machine to microprogram, code those 
portions, and then evaluate the resulting 
performance. The ultimate basis for the evaluation 
is the percent reduction in program execution time 
derived per word of WCS used and the consistency of 
the improvement across the spectrum of test 
programs. 


RESULTS 


To date a page of microcode has been coded. All of 
this code was written before the test series 
described above was completed, although preliminary 
test results were available. Tests using a line 
time clock have measured a 32 to 46% reduction in 
execution time for the test programs when compared 
to the macro-level LSI-11 interpreter. (Note, both 
interpreters use the EIS/FIS code). These results 
are shown in Fig. 1. The page of microcode 
contains 19 P-machine instructions and four general 
purpose instructions. The speed improvements 


Execution speed, percent of execution time, and execution frequency of individual 
P-machine instructions coded in LSI-11 assembler language. 


These results were obtained from 


The average execution speeds are adjusted to 


The percent of program execution time 
The interpreter fetch 


sequence and the single instruction SLDC are in microcode. 


Mnemonic Instruction 
CIP Call Intermediate Procedure 
CSP Call Standard Procedure 
FJP False Jump 
RNP Return From Non-base Procedure 
SRO Store Global Word 

-§LDO Short Load Global Word, total 
INN Set Inclusion 
SLDL Short Load Local Word, total 
LDO Load Global Word 
CLP Call Local Procedure 
EQUI Integer Comparison, = 
UJP Unconditional Jump 
LDM Load Multiple Words 
STL Store Local Word 
CXP Call External Procedure 
XJP Case Statement 
ADI Add Integer 
UNI Set Union 
LDB Load Byte 
LAO Load Global Address 
SIND Short Index and Load Word, total 
LLA Load Local Address 
SLDO12 Short Load Global Word, offset 12 
SLDC Short Load Word Constant, total 
IXA Index Array 
SLDO3 Short Load Global Word, offset 3 


Average Percentage Percentage 
execution of program of execution 
speed in us execution time frequency 
632 21.6 2.0 
1186 17.5 0.9 
25 Sf 8.7 
75 3.1 2:5 
31 2.5 4.7 
12 2.5 12.7 
114 2.1 1.1 
12 1.7 8.4 
32 1.6 3.0 
205 1.6 0.5 
20 1.6 4.5 
22 1.5 3.9 
68 1.4 1.2 
31 1.3 2.5 
487 4 0.1 
63 1.1 1.0 
10 1.0 6.0 
92 0.8 0.5 
21 0.8 2.3 
31 0.7 1.4 
17 0.7 2.4 
31 0.7 1.3 
12 0.7 3.4 
3 0.6 15.3 
75 0.5 0.4 
12 0.5 2.6 


The remaining Instructions have percents of program execution time of less than 0.5% and percents of total frequency of execution 


of less than 2.3%, 
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Fig. l. 


Percent reductions in program execution time obtained by the micro/macro interpreter 


when compared to the macro-level interpreter. 


Reduction 
execution 
time, % 


50 


40 


30 


20 


Whetstone 
(33.7) 


Compilations 
(41.5) 


(45.1) 


Quicksort (R). 


XREF 
(32.2) 


BTSI 
(45.1) 


Quicksort (NR) 
(45.6) 


Heapsort 
(43.4) 


obtained for the P-machine instructions and the 
number of words of WCS required to code them are 
given in Table 2. The general purpose instruction 
are: 
An instruction to retrieve "BIG" operands. 
These operands may be either one or two bytes 
long, depending on whether the sign bit of the 
first byte is set. 


An instruction to traverse down the static 
links of the P-machine stack n levels. 


A block move instruction that increments the 
source and destination addresses as each word 
is moved. 


A block move instruction that decrements the 
source and destination addresses as each word 
is moved. 


The microcode produced to date and the detailed 
test results and procedures are available from the 
authors on request. Test were run using the 
LSI-11/2 (KD11-HA) with the MSV11-DD 32K memory. 


WHAT CAN BE DONE 


Work is still in progress. Both the selection of 
routines to microcode and the density of the 
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microcode can be improved. The current reduction 
in execution time, as previously mentioned, is 32 
to 46%. We expect that this figure can be improved 
by several percentage points. A final figure of 
roughly 45 to 55% seems to be possible. When it is 
completed, this code could be programmed into 
read-only memory and made available at a price 
considerably lower than the WCS board. Although 
such a product may not be of widespread interest, 
some installations may find it worthwhile. 


An exciting possibility is the complete conversion 
of the microprocessor to a P-machine. In 
particular, if the Translation Array could be used 
for the interpreter fetch sequence and other 
control functions, speed improvements should far 
outstrip anything that can be accomplished with a 
single page of WCS. Space will still be a major 
problem even with the Translation Array. If this 
problem can be overcome speed improvements by a 
factor of four or more do not seem unrealistic. 
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Table 2. 


(nonrecursive). 


Microcoded P-machine instruction execution times and the number of WCS words used. 
results were obtained from a test consisting of two compilations, the Whetstone and Quicksort 
Execution speeds are adjusted to eliminate the time for the interpreter fetch sequence. 


The timing 


Similarly, microcode for the interpreter fetch sequence is not counted in the "words of WCS" figures. 
The number of "Words of WCS shared" with other routines are included in the number of "Total words of 
wos", 


* ° ° s 
Figure is an estimate. 


Micro 
Mnemonic Instruction time, us 
AND Logical And 5 
CHK Check Subrange Bounds 22 
CIP Call Intermediate Procedure 188 
CLP Call Local Procedure 75 
FJP False Jump 4 
GRTI Integer Comparison, > 10 
LAO Load Global Address 10 
|DCI Load Constant Word 5 
LDM Load Multiple Word 22 
LDO oad Global Word 10* 
LEQI Integer Comparison, < 11 
LLA Load Local Address g* 
NEQI Integer Comparison, # 8 
RNP Return From Non-base Procedure 24 
SLDC Short Load Word Constant 3 
SRO Store Global Word 10* 
STL Store Local Word 9 
UJP Unconditional Jump 6 
XJP Case Statement 16 


* 
Figure is an estimate. 


(1) 


(2) 


(3) 


(4) 
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